Modeling Content Identification from Document Images

نویسنده

  • Takehiro Nakayama
چکیده

A new technique to locate content-representing words for a given document image using abstract representation of character shapes is described. A character shape code representation defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character recognition (OCR). Because character shape codes are an abstraction of standard character code (e.g., ASCII), the mapping is ambiguous. In this paper, the ambiguity is shown to be practically limited to an acceptable level. It is illustrated that: first, punctuation marks are clearly distinguished from the other characters; second, stop words are generally distinguishable from other words, because the permutations of character shape codes in function words are characteristically different from those in content words; and third, numerals and acronyms in capital letters are distinguishable from other words. With these clAssifications, potential content-representing words are identified, and an analysis of their distribution yields their rank. Consequently, introducing character shape codes makes it possible to inexpensively and robustly bridge the gap between electronic documents and hard-copy documents for the purpose of content identification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

FONT DISCRIMINATIO USING FRACTAL DIMENSIONS

One of the related problems of OCR systems is discrimination of fonts in machine printed document images. This task improves performance of general OCR systems. Proposed methods in this paper are based on various fractal dimensions for font discrimination. First, some predefined fractal dimensions were combined with directional methods to enhance font differentiation. Then, a novel fractal dime...

متن کامل

Transition Potential Modeling of Land-Cover based on Similarity Weighted Instance-based Learning Procedure and Its Implication in the REDD Project Design Document

  Reducing Emissions from Deforestation and Forest Degradation (REDD) is a climate change mitigation strategy employed to reduce the intensity of deforestation and GHGS emissions. In recent decades, drastic land use changes in Mazandaran province caused a substantial reduction in the amount of Hyrcanian forests. The present research based on objectives of REDD projects paid to identify of fore...

متن کامل

پژوهشی کیفی در تحلیل الگوی بهره‌گیری خبرگان حوزه‌ی سلامت از تصاویر پزشکی

Introduction: In health sector, image functions as a form of document that can convey a considerable amount of information. Employing this type of information can increase the effectiveness of the performance of medical experts. This study aimed to survey how health experts use medical images in their practice. Methods: This applied qualitative study was carried out in 1392 (2013). The study p...

متن کامل

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994